A Basic Language Resource Kit for Persian
نویسندگان
چکیده
Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exploring the reusability of existing resources and adapting state-of-the-art methods for the linguistic annotation. We present fully functional tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and parsing. As for resources, we describe the Uppsala PErsian Corpus (UPEC) which is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation. The corpus consists of 2,782,109 tokens and is annotated with parts of speech and morphological features. A treebank is derived from UPEC with an annotation scheme based on Stanford Typed Dependencies and is planned to consist of 10,000 sentences of which 215 have already been annotated.
منابع مشابه
The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK
In this article, we have introduced the first parallel corpus of Persian with more than 10 other European languages. This article describes primary steps toward preparing a Basic Language Resources Kit (BLARK) for Persian. Up to now, we have proposed morphosyntactic specification of Persian based on EAGLE/MULTEXT guidelines and specific resources of MULTEXT-East. The article introduces Persian ...
متن کاملPersian in MULTEXT-East Framework
Farsi, also known as Persian, is the official language of Iran, Tajikistan and one of the two main languages spoken in Afghanistan. It is an Indo-European agglutinating language, written in Arabic script. This paper presents the first step in creating Farsi basic language resources kit. This Step comprises the specifications for morphosyntactic encoding, which is based on the EAGLES/MULTEXT mod...
متن کاملA BLARK extension for temporal annotation mining
The Basic Language Resource Kit (BLARK) proposed by Krauwer is designed for the creation of initial textual resources. There are a number of toolkits for the development of spoken language resources and systems, but tools for second level resources, that is, resources which are the result of processing primary level speech resources such as speech recordings. Typically, processing of this kind ...
متن کاملThe Effects of Bilingualism on Basic Color Terms in Persian
This study is to determine how bilingualism could influence the list of Persian basic color terms and their order. Using a monolingual Persian and a bilingual Kurd sample students, and a color list task, it is assumed that bilingualism could change the ordering of the non-basic color terms in the second language, but not the basic ones. Another assumption is that, the old usual methods for obta...
متن کاملEuropean Language Resources Association History and Recent developments
This paper aims at describing the rational behind the foundation of the European Language Resources Association (ELRA) in 1995 and its activities since then with a particular focus on Language Resources and Human Language Technologies Evaluation activities. The main message is the promotion of a concept of Basic language Resource Kit that should be available for all languages in order to suppor...
متن کامل